Lab 4:Mapping causal & predictive approaches

Practice session

Luisa M. Mimmi   (https://luisamimmi.org/)

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

  • Delve into causal analysis and the association between exposure and outcome
  • Visually explore causal pathways and influential variables with Directed Acyclic Graphs (DAGs):
    • Confounder: a variable that influences both the exposure and the dependent variable outcome
    • Mediator: a variable that lies on the causal path between the exposure and the outcome
    • Collider: a variable that is influenced by both the exposure and the outcome
  • Practically illustrate causal modeling in the presence of influential causal variables
    • Confounders must be adjusted for, as they distort the association between exposure and outcome (source of bias)
    • Mediators require careful consideration in analyses depending on the effect of interest (total vs. direct)
    • Colliders can introduce bias if conditioned upon, distorting the relationship by inducing spurious associations

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session

# --- General 
library(here) # A Simpler Way to Find Your Files   
library(skimr)    # Compact and Flexible Summaries of Data
library(paint)   # Fancy structure for data frames 
library(janitor)  # Simple Tools for Examining and Cleaning Dirty Data
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(gt) # Easily Create Presentation-Ready Display Tables
library(marginaleffects) # Marginal Effects for Model Objects
library(WeightIt) # Weighting for Covariate Balance in Observational Studies
# --- Plotting & data visualization
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(ggpubr)        # 'ggplot2' Based Publication Ready Plots
library(patchwork) # The Composer of Plots
# --- Descriptive statistics and regressions
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(Hmisc) # Harrell Miscellaneous
library(estimatr) # Fast Estimators for Design-Based Inference
library(modelsummary) # Summary Tables and Plots for Statistical Models and Data:
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax
library(sandwich) #for robust variance estimation
library(survey) # Analysis of Complex Survey Samples
library(haven) # Import and Export 'SPSS', 'Stata' and 'SAS' Files
library(simstudy) # Simulation of Study Data
library(NHANES) # Data from the US National Health and Nutrition Examination Study  

DATASETS for today

Importing Dataset 1 (NHANES)

Name: NHANES, accessible from package NHANES
Documentation: See reference on the data downloaded and ….
Sampling details: We’ll use a subset of the NHANES dataset, focusing on variables related to smoking status (SmokeNow), systolic blood pressure (BPSysAve), Body Mass Index (BMI) and age (Age).

set.seed(123) # Set seed for reproducibility
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_data <- NHANES %>%
  select(ID, Gender, Age, Race1, Height, Weight, BMI, SmokeNow, PhysActive, PhysActiveDays,
         BPSysAve, BPSysAve, BPDiaAve, TotChol, DirectChol, Diabetes, HealthGen,
         Education, HHIncomeMid, Poverty) %>%
  drop_na() %>% 
  # make it smaller 
  slice_sample(n = 1000) # Randomly select 1000 rows using

# Take a quick look at the data
#paint(nhanes_data)

Dataset 1 (NHANES) Variables and their description

EXCERPT: see complete file in Input Data Folder

Variable Type Description
ID int xxxxx
Gender chr Gender (sex) of study participant coded as male or female
Age int ##
AgeDecade chr yy-yy es 20-29
Race1 chr Reported race of study participant: Mexican, Hispanic, White, #Black, or Other
Education chr [>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
HHIncome chr Total annual gross income for the household in US dollars
HHIncomeMid int Numerical version of HHIncome derived from the middle income # in each category. Ex. 12500 40000
Poverty dbl A ratio of family income to poverty guidelines. Smaller # numbers indicate more poverty Ex.. 0.95 1.74 4.99
Weight dbl Weight in kg
Height dbl Standing height in cm. Reported for participants aged 2 years or older.
BMI dbl Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse int 60 second pulse rate
BPSysAve int Combined systolic blood pressure reading, following the # procedure outlined for BPXSAR
BPDiaAve int Combined diastolic blood pressure reading, following the # procedure outlined for BPXDAR
DirectChol dbl Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol dbl Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
Diabetes chr Study participant told by a doctor or health professional that they have diabetes
HealthGen chr Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
PhysActive chr Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
SmokeNow chr Study participant currently smokes cigarettes regularly. (Yes or No)

VISUALLY EXPLORE CAUSAL PATHWAYS WITH DAGS

DAG with collider

# df DAG 
library(ggdag)
dag_collid <- dagify( Y ~ X + e, Z ~ X + Y, 
                      exposure = "X",outcome = "Y", #latent = "e",
                      # labels
                      labels = c("Z" = "Collider",  "X" = "Exposure",
                                 "Y" = "Outcome", "e" = "Unobserved \nerror"),
                      # positions
                      coords = list(
                        x = c(X = 0, Y= 4, Z = 2 , e = 5),
                        y = c(X = 0, Y = 0, Z = 2, e = 1) 
                      )) %>% 
  tidy_dagitty() 

# Plot DAG 
dag_col_p <-  dag_collid %>% 
  ggdag(., layout = "circle") +  # decided in dagify
  theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
  geom_dag_text(color="white") +
  labs(title = "Causal map with COLLIDER (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
  scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")    # Remove legend

dag_col_p

DAG with collider

DAG with confounder

# df DAG 
dag_confounder <- dagify(Y ~ X + Z + e, X ~  Z, 
                         exposure = "X", outcome = "Y", #latent = "e",
                         # labels
                         labels = c(
                           "Z" = "Confounder",
                           "X" = "Exposure",
                           "Y" = "Outcome",
                           "e" = "Unobserved \nerror"),
                         # positions
                         coords = list(
                           x = c(X = 0, Y= 4, Z = 2 , e = 5),
                           y = c(X = 0, Y = 0, Z = 2, e = 1) 
                         )) %>% 
  tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_conf_p <-  dag_confounder %>% 
  ggdag(., layout = "circle") + 
  theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  geom_dag_text(color="white") +
  labs(title = "Causal map with CONFOUNDER (Z)" , #subtitle = " ",
        caption = ) +
  # Map colors to specific nodes
  scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")   
dag_conf_p

DAG with confounder

DAG with mediator

# df DAG 
dag_med <- dagify( Y ~ Z + e,  Z ~ X, exposure = "X", outcome = "Y",  #latent = "e",
                   # labels
                   labels = c(
                     "Z" = "Mediator",
                     "X" = "Exposure",
                     "Y" = "Outcome",
                     "e" = "Unobserved \nerror"),
                   # positions
                   coords = list(
                     x = c(X = 0, Y= 4, Z = 2 , e = 5),
                     y = c(X = 0, Y = 0, Z = 2, e = 1) 
                   )) %>% 
  tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_med_p <-  dag_med %>% 
  ggdag(., layout = "circle") +  # decided in dagify
  theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
  geom_dag_text(color="white") +
  labs(title = "Causal map with MEDIATOR (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
  scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")    # Remove legend

dag_med_p

DAG with mediator

CAUSAL MODELING WITH CONFOUNDER

DAG with confounder

Z = Age = confounder
X = SmokeNow
Y = BPSysAve (blood pressure)

Example of CONFOUNDER

  • Z = Age = confounder for the relationship between X = SmokeNow and Y = BPSysAve (blood pressure)
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_conf <- NHANES %>%
  select(ID, Age, SmokeNow, BPSysAve) %>%
  drop_na()
  
# Take a quick look at the data
paint(nhanes_conf)
tibble [3108, 4]
ID       int 51624 51624 51624 51630 51654 51666
Age      int 34 34 34 49 66 58
SmokeNow fct No No No Yes No Yes
BPSysAve int 113 113 113 112 111 127

In this context:

  • Age influences both smoking (SmokeNow) and blood pressure (BPSysAve), making it a confounder.
  • SmokeNow directly affects BPSysAve.

Visualisation of CONFOUNDER

# linear rel 
qplot (Age, BPSysAve, data = nhanes_conf, color = SmokeNow) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot of Age and Blood Pressure by Smoking Status",
       x = "Age",
       y = "Blood Pressure (mmHg)",
       color = "Smoking Status")

Visualisation of CONFOUNDER

Regression analysis: controlling for the confounder

Unadjusted model:

# Unadjusted linear model 
model_unadjusted <- lm(BPSysAve ~ SmokeNow, data = nhanes_conf)
summary(model_unadjusted)$coefficients
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   124.38      0.432  287.97  0.0e+00
SmokeNowYes    -4.26      0.643   -6.64  3.8e-11

Adjusted model:

  • includes the confounder Age in the model
# Adjusted model
model_adjusted <- lm(BPSysAve ~ SmokeNow + Age, data = nhanes_conf)
summary(model_adjusted)$coefficients
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  100.997     1.0886   92.78  0.00e+00
SmokeNowYes    0.684     0.6313    1.08  2.79e-01
Age            0.432     0.0187   23.09 5.19e-109

Compare models (without and with confounder)

  • Adjusting for Age changes the estimated effect of SmokeNow on blood pressure (BPSysAve):
    • in the Unadjusted Model any observed relationship between smoking and blood pressure might be confounded.
    • in the Adjusted Model (which ‘controls for age’) a more accurate estimate of the causal effect of smoking on blood pressure accounts for the confounding influence of age.
# Create the table with a custom statistic formatter
conf_sum <- modelsummary::msummary(
  list(
    "NO Confounder" = model_unadjusted, 
    "Confounder" = model_adjusted
  ),
  output = "gt",
  statistic = c(
    "conf.int",
    "s.e. = {std.error}"
  ), 
  fmt = "%.3f", 
  gof_omit = 'DF|Deviance|Log.Lik.|F|R2 Adj.|AIC|BIC|RMSE',
  stars = c('*' = .05, '**' = .01)
)

# Render the table directly
conf_sum  %>%    # text and background color
  tab_style(
    style = cell_text(weight = "bold", color = "#005ca1"),
    locations = cells_column_labels(
      columns = c("NO Confounder", "Confounder"))) %>% 
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(rows = c(1,4,7))) %>% 
  tab_style(style = cell_fill(color = 'lightblue'),
            locations = cells_body(rows = 4:5))
NO Confounder Confounder
(Intercept) 124.384** 100.997**
[123.537, 125.231] [98.863, 103.131]
s.e. = 0.432 s.e. = 1.089
SmokeNowYes -4.264** 0.684
[-5.524, -3.004] [-0.554, 1.922]
s.e. = 0.643 s.e. = 0.631
Age 0.432**
[0.395, 0.468]
s.e. = 0.019
Num.Obs. 3108 3108
R2 0.014 0.158
* p < 0.05, ** p < 0.01

Compare predicted values from 2 models

  • With tidyr::pivot_longer() we will reshape the dataset so we can plot the predicted values from both models in a faceted plot.
# Add predicted values from the unadjusted and adjusted models
nhanes_conf <- nhanes_conf %>%
  mutate(
    pred_unadjusted = predict(model_unadjusted),  # Predicted BPSysAve from unadjusted model
    pred_adjusted = predict(model_adjusted)       # Predicted BPSysAve from adjusted model
  )

# Reshape the data into long format using pivot_longer()
nhanes_long <- nhanes_conf %>%
  pivot_longer(cols = c(pred_unadjusted, pred_adjusted), 
               names_to = "model", values_to = "predicted") %>%
  mutate(model = recode(model, 
                        pred_unadjusted = "Unadjusted: SmokeNow on BPSysAve", 
                        pred_adjusted = "Adjusted: SmokeNow + Age on BPSysAve"))

Plot predicted values from 2 models

# Violin plot with dashed line connecting the two conditions
ggplot(nhanes_long, aes(x = factor(SmokeNow), y = BPSysAve, fill = model)) +
  # Create the violin plot to show the distribution of blood pressure
  geom_point(color = "grey", alpha = 0.4, position = position_dodge(width = 0.3) ) +
  # Overlay the predicted dashed line between the two smoking conditions
  geom_line(aes(y = predicted, group = model, color = model), 
            size = 0.8, linetype = "dashed", position = position_dodge(width = 0.3)) +
  # Facet vertically for unadjusted and adjusted models
  facet_wrap(model ~ ., scales = "free_y",ncol = 2) +
  # Add labels and title
  labs(title = "Confounder Analysis: Predicted values in Unadjusted vs Adjusted Regression models",
       subtitle = "Age = Confounder",
       x = "Smoking Status",
       y = "Systolic Blood Pressure (BPSysAve)",
       color = "Model",
       fill = "Model"
       ) +
  # Customize colors for better contrast
  scale_fill_manual(values = c("Unadjusted: SmokeNow on BPSysAve" ="lightcoral" , 
                               "Adjusted: SmokeNow + Age on BPSysAve" = "lightblue")) +
  scale_color_manual(values = c("Unadjusted: SmokeNow on BPSysAve" = "red", 
                                "Adjusted: SmokeNow + Age on BPSysAve" = "blue")) +
  # Minimal theme for clarity
  theme_minimal()+
  theme(legend.position = "none")

Plot predicted values from 2 models

CAUSAL MODELING WITH MEDIATOR

DAG with mediator

M = BMI is a mediator
X = PhysActiveDays
Y = BPSysAve (blood pressure)

The case of a mediator

  • M = BMI is a mediator for the effect of X = PhysActiveDays on Y = BPSysAve
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_med <- NHANES %>%
  select(Gender, Age, BPSysAve, BMI, PhysActiveDays, Diabetes, SmokeNow) %>%
  drop_na()

# Take a quick look at the data
summary(nhanes_med)
    Gender         Age          BPSysAve        BMI       PhysActiveDays
 female:632   Min.   :20.0   Min.   : 81   Min.   :15.0   Min.   :1.0   
 male  :831   1st Qu.:34.0   1st Qu.:111   1st Qu.:23.6   1st Qu.:2.0   
              Median :47.0   Median :119   Median :27.0   Median :3.0   
              Mean   :47.8   Mean   :122   Mean   :27.9   Mean   :3.6   
              3rd Qu.:60.0   3rd Qu.:131   3rd Qu.:31.3   3rd Qu.:5.0   
              Max.   :80.0   Max.   :221   Max.   :59.1   Max.   :7.0   
 Diabetes   SmokeNow 
 No :1312   No :834  
 Yes: 151   Yes:629  
                     
                     
                     
                     

Fitting unadjusted (total) model

Before adjusting for any mediator, we fit a simple linear model to estimate the total effect of PhysActiveDays on Systolic Blood Pressure (disregarding any possible mediation of BMI).

  • Total Effect Model (Unadjusted)
    • This model gives the total effect of of PhysActiveDays on Systolic Blood Pressure (including any indirect effects through M = BMI or other variables that haven’t been included.)
# Unadjusted model: total effect of smoking on blood pressure
total_effect_model <- lm(BPSysAve ~ PhysActiveDays, data = nhanes_med)
summary(total_effect_model)$coefficients
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      120.12      1.055  113.88   0.0000
PhysActiveDays     0.59      0.262    2.25   0.0246

Set up the mediation models 1/2

Then, we fit the model to estimate the effect of X = PhysActiveDays on M = BMI:

  • Mediator model
    • This model allows us to see whether more physically active days are associated with (lower) BMI values (which could then affect blood pressure).
# Mediator model: SmokeNow affects BMI
mediator_model <- lm(BMI ~ PhysActiveDays, data = nhanes_med)
summary(mediator_model)$coefficients
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     28.2176     0.3444  81.934    0.000
PhysActiveDays  -0.0821     0.0856  -0.959    0.338

Set up the mediation models 2/2

Lastly, we fit the model to estimate the effect of X = PhysActiveDays on Y = BPSysAve while adjusting for M = BMI:

  • Outcome Model (Adjusted for BMI)
    • This model estimates the effect of PhysActiveDays and BMI on Systolic Blood Pressure (BPSysAve). This is the adjusted model.
# Outcome model: PhysActiveDays and BMI affect BPSysAve
outcome_model <- lm(BPSysAve ~  PhysActiveDays + BMI, data = nhanes_med)
summary(outcome_model)$coefficients
               Estimate Std. Error t value  Pr(>|t|)
(Intercept)     109.353     2.4763   44.16 3.15e-271
PhysActiveDays    0.621     0.2603    2.39  1.71e-02
BMI               0.382     0.0795    4.80  1.77e-06

Compare 3 models

  • The Total (unadjusted) effect model estimates the effect of PhysActiveDays on BPSysAve.
  • The Mediator model estimates the effect of PhysActiveDays on BMI (M).
  • The Outcome model estimates the effect of (PhysActiveDays + BMI) on BPSysAve.
Tot (unadj) Mod.
Med Mod.
Outcome Mod.
BPSysAve ~ Total Effect BMI ~ Mediator Effect BPSysAve ~ Outcome
(Intercept) 120.120** 28.218** 109.353**
[118.051, 122.189] [27.542, 28.893] [104.495, 114.210]
s.e. =1.055 s.e. =0.344 s.e. =2.476
PhysActiveDays 0.590* -0.082 0.621*
[0.076, 1.104] [-0.250, 0.086] [0.111, 1.132]
s.e. =0.262 s.e. =0.086 s.e. =0.260
BMI 0.382**
[0.226, 0.538]
s.e. =0.080
Num.Obs. 1463 1463 1463
R2 0.003 0.001 0.019
* p < 0.05, ** p < 0.01

Calculate the Indirect, Direct, and Total Effects (1/2)

Using the coefficients from the models, we can calculate the mediation effects manually.

  • Direct Effect: This is the coefficient of PhysActiveDays in the outcome model, which gives the direct effect of exercise on blood pressure (controlling for BMI = M)
  • Indirect Effect: This is the product of the coefficient from the mediator model (PhysActiveDaysBMI) and the coefficient from the outcome model (BMIBPSysAve).
  • Total Effect: This is the coefficient of PhysActiveDays in the unadjusted model (NOT controlling for BMI = M).

Calculate the Indirect, Direct, and Total Effects (2/2)

Breakdown of the estimated effects:

# TOTAL effect: The effect of PhysActiveDays on BPSysAve without adjusting for BMI
total_effect <- coef(total_effect_model)["PhysActiveDays"]# 0.59
total_effect
PhysActiveDays 
          0.59 
# DIRECT effect: The effect of PhysActiveDays on BPSysAve after adjusting for BMI.
direct_effect <- coef(outcome_model)["PhysActiveDays"]# 0.621 
direct_effect
PhysActiveDays 
         0.621 
# Indirect effect: The portion of the effect on BPSysAve that is mediated through BMI
indirect_effect <- coef(mediator_model)["PhysActiveDays"] * coef(outcome_model)["BMI"] # [-0.0821 X 0.382]  
indirect_effect
PhysActiveDays 
       -0.0313 
# Proportion mediated: The proportion of the total effect that is mediated through BMI
proportion_mediated <- glue::glue("{round(indirect_effect / total_effect *100, 2)}%")
proportion_mediated #  -0.0531
-5.31%

_ QUI _______

Visualise the Indirect, Direct, and Total Effects

“+ Physical Activity” -> “+ Systolic Blood Pressure (BPSysAve)” | “+ BMI -> - BPSysAve” WTF ???

# Create a data frame for the effects
effects_df <- data.frame(
  Effect = c("Total Effect", "Direct Effect", "Indirect Effect"),
  Value = c(total_effect, direct_effect, indirect_effect)
)

# Create the bar plot
ggplot(effects_df, aes(x = Effect, y = Value, fill = Effect)) +
  geom_bar(stat = "identity") +
  labs(title = "Mediation Analysis: Total, Direct, and Indirect Effects", 
       subtitle = "Effect (coefficient) of Physical Activity on Systolic Blood Pressure (BPSysAve)",
       y = "", x = "") +
  theme_minimal()

Interpret results

  • The total effect tells us the overall relationship between PhysActiveDays and BPSysAve.

  • The direct effect represents the relationship between PhysActiveDays and BPSysAve, controlling for BMI.

  • The indirect effect (mediated effect) shows how much of the relationship between PhysActiveDays and BPSysAve is explained through BMI (AKA the coef of PhysActiveDays in the mediator model times the coef of BMI in the outcome model)

  • The proportion mediated indicates how much of the total effect is due to the mediation by BMI.

Calculate the Indirect, Direct, and Total Effects (modo2)

library(mediation)
# recall
mediator_model[["call"]][["formula"]]
BMI ~ PhysActiveDays
outcome_model[["call"]][["formula"]]
BPSysAve ~ PhysActiveDays + BMI
# Calculate the mediation effect
mediation_model <- mediate(mediator_model, outcome_model, 
                           treat = "PhysActiveDays", 
                           mediator = "BMI", 
                           boot=TRUE, sims=500)
summary(mediation_model)

Causal Mediation Analysis 

Nonparametric Bootstrap Confidence Intervals with the Percentile Method

               Estimate 95% CI Lower 95% CI Upper p-value  
ACME            -0.0313      -0.1067         0.03   0.256  
ADE              0.6213       0.1246         1.16   0.016 *
Total Effect     0.5899       0.0576         1.10   0.020 *
Prop. Mediated  -0.0531      -0.5263         0.06   0.276  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sample Size Used: 1463 


Simulations: 500 
#                Estimate 95% CI Lower 95% CI Upper p-value  
# ACME            -0.0313      -0.1024         0.04   0.324     = indirect or mediation effect
# ADE              0.6213       0.1479         1.16   0.024 *   = direct effect 
# Total Effect     0.5899       0.1354         1.15   0.020 *   = total effect 
# Prop. Mediated  -0.0531      -0.2969         0.10   0.344     = proportion mediated
# Add predicted values for BMI, BPSysAve (adjusted), and BPSysAve (total effect)
nhanes_med_pred <- nhanes_med %>%
  mutate(
    pred_bmi = predict(mediator_model),         # Predicted BMI (mediator model)
    pred_bps_adj = predict(outcome_model),      # Predicted BPSysAve (adjusted for Age)
    pred_bps_total = predict(total_effect_model) # Predicted BPSysAve (total effect)
  )

Plot results 1/2

Now, let’s create a plot that includes:

  • The total effect of BMI on blood pressure (without adjusting for Age).
  • The adjusted effect of BMI on blood pressure (controlling for Age).
  • The mediator effect (BMI’s effect on Age).
# Check column names
colnames(nhanes_med_pred)
 [1] "Gender"         "Age"            "BPSysAve"       "BMI"           
 [5] "PhysActiveDays" "Diabetes"       "SmokeNow"       "pred_bmi"      
 [9] "pred_bps_adj"   "pred_bps_total"
# Reshape the data into long format for faceting
nhanes_long <- nhanes_med_pred %>%
  gather(key = "model", value = "predicted",  pred_bps_adj, pred_bps_total) %>%
  mutate(model = recode(model, 
                        #pred_bmi = "Mediator Effect: BPSysAve on BMI", 
                        pred_bps_adj = "Adjusted Effect: PhysActiveDays + BMI on BPSysAve", 
                        pred_bps_total = "Total Effect: PhysActiveDays on BPSysAve"))

Plot results 2/2

# Plot with vertically aligned facets
ggplot(nhanes_long, aes(x = BMI)) +
  geom_point(aes(y = BPSysAve), color = "black", shape = 16, alpha = 0.5) +  # Actual data points for BPSysAve
  geom_line(aes(y = predicted, color = model), size = 1) +  # Predicted values from different models
  facet_grid(model ~ ., scales = "free_y") +  # Facet vertically
  labs(title = "Mediation Analysis: Total, Adjusted, and Mediator Effects",
       x = "BMI",
       y = "Outcome / Mediator",
       color = "Effect Type") +
  theme_minimal()

Plot results 2/2

—-

QUI

Equitable Equations !!!!! https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/

Importing Dataset 2 (Gabor)

Name:
Documentation: .. Sampling details:

# import data
data <- read_csv(paste0(data_in,"food-health.csv")) # 164,848 

# drop observations
dat <-  data %>%
  filter(age >= 30 & age <60) %>% # 7930 
  # new variables: 
  ## Fruit and vegetables per day (grams)
  ## Blood pressure (systolic+diastolic)
  mutate(fv=veggies_n_fruits_gr,
         bp=blood_pressure) %>%
  filter(fv<3200) %>%
  drop_na(bp) %>% # 7358
  # Days per week exercising 
  mutate(exerc=case_when(paq655 <=7 ~ paq655,
                         paq650 ==2 ~ 0)) %>% 
  # Potato chips per day, grams
  mutate(pchips=gr_potato_chips) %>% 
# select variables
  select(id = seqn, age, gender, weight, height, race, 
         fv, exerc, pchips, 
         bp_systolic, bp_diastolic, bp, bmi,  
         smoker, total_cholesterol, hdl,
         hh_income_usd, hh_income_percap, ln_hh_income_percap ,income_cat
                  ) %>% 
  drop_na() # 7358
 

# Hmisc::describe(dat$hh_income)
paint(dat)

Dataset 2 (Gabor) Variables and their description

[EXCERPT: see complete file in Input Data Folder]

https://bookdown.org/jbrophy115/bookdown-clinepi/confounding.html

Causal modeling with CONFOUNDER

# Select relevant columns and drop rows with missing values
dat_conf <- dat %>%
  select(id, age, bp_systolic, fv, exerc, bmi, smoker, gender, hh_income_usd, ln_hh_income_percap ) %>%
  drop_na() %>% 
  filter(fv <=1500)

# Take a quick look at the data
summary(dat_conf)

Hmisc::describe(dat_conf$bp_systolic) # Y
Hmisc::describe(dat_conf$bmi) # X
Hmisc::describe(dat_conf$age) # Z (confounder )
Hmisc::describe(dat_conf$exerc)  # Z (confounder )
Hmisc::describe(dat_conf$ln_hh_income_percap) # Z (confounder )

In this context:

  • exerc influences both smoking (fv) and blood pressure (bp_systolic), making it a confounder (albeit as PROXY for socio-economic status)
  • fv directly affects bp_systolic

— Regression analysis

# Unadjusted model
model_unadjusted <- lm( bp_systolic ~ bmi + ln_hh_income_percap , data = dat_conf)
summary(model_unadjusted)

# Adjusted model
model_adjusted <- lm(bp_systolic ~ bmi + age + ln_hh_income_percap, data = dat_conf)
summary(model_adjusted)
  • adjusting for Z changes the estimated effect of X on Y
    • Unadjusted Model: This model does not control for Z, which is a confounder of the relationship between X and Y.
    • Adjusted Model: This model controls for Z, providing a more accurate estimate of the causal effect of X on Y.

— Add Predicted Values to the Dataset

model_unadjusted$coefficients 
model_adjusted$coefficients

# Get predicted values using augment and attach to the original dataset
dat_unadjusted <- broom::augment(model_unadjusted, data = dat_conf) %>%
  select( -.resid,  -.hat, -.sigma, -.cooksd, -.std.resid ) %>%
  rename(pred_unadjusted = .fitted)

dat_adjusted <- broom::augment(model_adjusted, data = dat_conf) %>%
  select( -.resid,     -.hat, -.sigma, -.cooksd, -.std.resid ) %>%
  rename(pred_adjusted = .fitted)

# Combine the predicted values from both models
dat_conf_long <- dat_unadjusted %>%
  left_join(dat_adjusted, by = "id", suffix=c("",".y"))%>%
  select(-ends_with(".y")) %>% 
  # Convert the data from wide to long format
  pivot_longer(cols = c(pred_unadjusted, pred_adjusted), 
               names_to = "model", 
               values_to = "predicted")

# Check the structure of the long dataset
str(dat_conf_long)

— Plot results 1/2

  • Reshape the Data Using pivot_longer()

We will reshape the dataset so we can plot the predicted values from both models in a faceted plot.

# check 
dat_conf_long %>% group_by(model) %>% 
  summarise(mean(predicted), sd(predicted))

— Plot results 2/2

ggplot(dat_conf, aes(x = bmi, y = bp_systolic)) +
  geom_point(color = "grey", alpha = 1) +  
  geom_point(data = dat_unadjusted, aes(x = bmi, y = pred_unadjusted), 
             color = "#e60066", alpha = 0.20) +
  geom_point(data = dat_adjusted, aes(x = bmi, y = pred_adjusted), 
             color = "#399B23", alpha = 0.20) +
  # Add labels and title
  labs(title = "Confounder Analysis: Unadjusted (pink) vs Adjusted (green) fitted values",
       x = "BMI",
       y = "Systolic Blood Pressure (BPSysAve)"
       )  +
    # Minimal theme for clarity
  theme_minimal() +
  theme(legend.position = "none")

— Plot results 2/2

— Different

https://nicholasrjenkins.science/post/data_viz_r/data_visualization_r/

— Regression analysis

# Unadjusted model
model_unadjusted <- lm( bp_systolic ~ bmi  , data = dat_conf)  
 # Adjusted model
model_adjusted <- lm(bp_systolic ~ bmi + age  , data = dat_conf)  
tidy(model_unadjusted, conf.int = TRUE ) %>% 
  ggplot(data = ., 
         mapping = aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_pointrange() +
  labs(title = "Model Estimates of bmi on bp_systolic",
       x = "Coefficient Estimate",
       y = "Predictor",
       caption = "Models fit with OLS. Error bars show the 95% confidence interval.") +
  scale_y_discrete(labels = c("Intercept",  "bmi")) +
  ggpubr::theme_pubclean(flip = TRUE)
tidy(model_adjusted, conf.int = TRUE ) %>% 
  ggplot(data = ., 
         mapping = aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_pointrange() +
  labs(title = "Model Estimates of bmi + age on bp_systolic",
       x = "Coefficient Estimate",
       y = "Predictor",
       caption = "Models fit with OLS. Error bars show the 95% confidence interval.") +
  scale_y_discrete(labels = c("Intercept", "age", "bmi")) +
  ggpubr::theme_pubclean(flip = TRUE)
fit_unadj_results <- tidy(model_unadjusted, conf.int = TRUE) %>% 
  mutate(model = "Model Unadjusted")

fit_adj_results <- tidy(model_adjusted, conf.int = TRUE) %>% 
  mutate(model = "Model Adjusted")

model_results <- bind_rows(fit_unadj_results, fit_adj_results)
ggplot(data = model_results, 
       aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high, 
           color = model, shape = model)) +
  geom_pointrange(position = position_dodge(width = 0.5)) +
  labs(title = "Model Estimates of BMI and age on REM Sleep",
       x = "Coefficient Estimate",
       y = "Predictor",
       caption = "Models fit with OLS. Error bars show the 95% confidence interval.",
        color = "Model:",
       shape = "Model:") +
  scale_y_discrete(labels = c("Intercept", "age", "bmi")) +
  ggpubr::theme_pubclean(flip = FALSE)

CALCULATING OUTCOME EFFECTS



The explanation and examples below follow very closely two sources:

  1. Noah Greifer, Elizabeth A. Stuart, Choosing the Causal Estimand for Propensity Score Analysis of Observational Studies
  2. Andrew Heiss, Demystifying causal inference estimands: ATE, ATT, and ATU

Recalling some vocabulary

  • \(X\) = treatment or intervention
  • \(Y^1\) = potential outcome if treated
  • \(Y^0\) = potential outcome if not treated
  • \(\delta\) = difference \((Y^1 - Y^0)\), or the individual-level causal effect.

EXAMPLE with toy case

basic_po <- tribble(
  ~id, ~age,    ~treated, ~outcome_1, ~outcome_0,
  1,   "Old",   1,        80,         60,
  2,   "Old",   1,        75,         70,
  3,   "Old",   1,        85,         80,
  4,   "Old",   0,        70,         60,
  5,   "Young", 1,        75,         70,
  6,   "Young", 0,        80,         80,
  7,   "Young", 0,        90,         100,
  8,   "Young", 0,        85,         80
) |> 
  dplyr::mutate(
    ice = outcome_1 - outcome_0,
    outcome = ifelse(treated == 1, outcome_1, outcome_0)
  )

ITE in example

  • Each person has two potential outcomes and an individual causal effect (\(ITE_i= Y_{i}^1 − Y_{i}^0\) = \(\delta_i\)).
  • \(\delta_i\) it’s only measurable with a time machine or some way to observe parallel universes.
Table 1: Standard table showing potential outcomes, individual causal effects, and realized outcomes
Confounder
Treatment
Unobservable
Realized
ID
Age
Treated
Potential outcomes
ITE or \(\delta_i\)*
Outcome
(Z_i) (X_i) (Y^1_i) (Y^0_i) (Y^1_i - Y^0_i) (Y_i)
1 Old 1 80 60 20 80
2 Old 1 75 70 5 75
3 Old 1 85 80 5 85
4 Old 0 70 60 10 60
5 Young 1 75 70 5 75
6 Young 0 80 80 0 80
7 Young 0 90 100 −10 100
8 Young 0 85 80 5 80
* ITE = individual causal effect

ATE = Average Treatment Effect

If we could compare all these people in Universe A and Universe B and measure their individual causal effects, we could calculate the Average Treatment Effect (ATE), or the effect of the intervention across the whole population. Officially, the ATE is the average of the \(\delta\) column, or the average of \(Y^1 - Y^0\):

\[ \begin{aligned} \text{ATE} &= E[\delta_i] & \text{ or} \\ \text{ATE} &= E[Y^1_i - Y^0_i] \end{aligned} \]

Given the data in Table 1, the ATE is 5—it’s the average of the \(\delta\) column:

\[ \text{ATE} = \frac{20 + 5 + 5 + 5 + 10 + 0 + -10 + 5}{8} = 5 \]

ATT = Average Treatment Effect on Treated

There are a couple other causal effects we can measure. We might want to know how big the effect is just for those who received the treatment. This is called the Average Treatment effect on the Treated (ATT), and is the average of the individual causal effects just among those who were treated. Officially, it’s defined like this:

\[ \begin{aligned} \text{ATT} &= E[\delta_i \mid X_i = 1] & \text{or} \\ \text{ATT} &= E[Y^1_i - Y^0_i \mid X_i = 1] \end{aligned} \]

In this case, that means we’re only looking at the average \(\delta\) for rows 1, 2, 3, and 5 in Table 1:

\[ \text{ATT} = \frac{20 + 5 + 5 + 5}{4} = 8.75 \]

ATU = Average Treatment Effect on Untreated

We can also calculate the Average Treatment effect on the Untreated (ATU; sometimes called the ATC (for effect on the control group)) by finding the average of the individual causal effects among those who were not treated:

\[ \begin{aligned} \text{ATU} &= E[\delta_i \mid X_i = 0] & \text{or} \\ \text{ATU} &= E[Y^1_i - Y^0_i \mid X_i = 0] \end{aligned} \]

Here, we’re only looking at the average \(\delta\) for rows 4, 6, 7, and 8 in Table 1:

\[ \text{ATU} = \frac{10 + 0 - 10 + 5}{4} = 1.25 \]

Relation among ATE, ATT and ATU

In fact, the ATE is technically a weighted average of the ATT and the ATU (here \(\pi\) indicates “proportion”):

\[ \text{ATE} = (\pi_\text{Treated} \times \text{ATT}) + (\pi_\text{Untreated} \times \text{ATU}) \]

In our example, Table 1, we can get the same ATE:

\[ \begin{aligned} \text{ATE} &= (\frac{4}{8} \times 8.75) + (\frac{4}{8} \times 1.25) \\ &= 4.375 + 0.625 \\ &= 5 \end{aligned} \]

Decomposing the ATE into these two parts like this shows that there’s some systematic difference between the treated and untreated people. This difference is called selection bias.

We can see the selection bias in an alternative decomposition of the ATE:

\[ \text{ATE} = \text{ATT} + \text{Selection bias} \]

The fact that the ATT (8.75) is bigger than the ATE (5) here is a sign the two groups are different. This intervention, whatever it is, has a big effect on the treated people who signed up for it, likely because they somehow knew that the intervention would be helpful for them.

Those who were untreated have a really low ATU (1.25)—they likely didn’t sign up for the intervention because they knew that it wouldn’t do much for them.

(4/8*20) + (4/8*(-11.667))
[1] 4.17

Relation among ATE, ATT and ATU (R)

effect_types <- basic_po |> 
  dplyr::group_by(treated) |> 
  dplyr::summarize(
    effect = mean(ice),
    n = n()
  ) |> 
  dplyr::mutate(
    prop = n / sum(n),
    weighted_effect = effect * prop
  ) |> 
  dplyr::mutate(estimand = case_match(treated, 0 ~ "ATU", 1 ~ "ATT"), .before = 1)

effect_types
# A tibble: 2 × 6
  estimand treated effect     n  prop weighted_effect
  <chr>      <dbl>  <dbl> <int> <dbl>           <dbl>
1 ATU            0   1.25     4   0.5           0.625
2 ATT            1   8.75     4   0.5           4.38 
# # A tibble: 2 × 6
#   estimand treated effect     n  prop weighted_effect
#   <chr>      <dbl>  <dbl> <int> <dbl>           <dbl>
# 1 ATU            0   1.25     4   0.5           0.625
# 2 ATT            1   8.75     4   0.5           4.38 

effect_types |> 
  dplyr::summarize(ATE = sum(weighted_effect))
# A tibble: 1 × 1
    ATE
  <dbl>
1     5
# # A tibble: 1 × 1
#     ATE
#   <dbl>
# 1     5

ITE = Simulation

def <- simstudy::defData(varname = "C", formula = 0.4, dist = "binary")
def <- simstudy::defData(def, "X", formula = "0.3 + 0.4 * C", dist = "binary")
def <- simstudy::defData(def, "e", formula = 0, variance = 2, dist = "normal")
def <- simstudy::defData(def, "Y0", formula = "2 * C + e", dist="nonrandom")
def <- simstudy::defData(def, "Y1", formula = "1 + 2 * C + e", dist="nonrandom")
def <- simstudy::defData(def, "Y_obs", formula = "Y0 + (Y1 - Y0) * X", dist = "nonrandom") #  = Y1*X + (1-X)*Y0

set.seed(2017)
dt <- genData(10000, def)
paste0("Calculated mean causal effect = ", mean(dt[, Y1] - dt[, Y0])) #mean difference of counterfactual outcomes
[1] "Calculated mean causal effect = 1"
paste0("Calculated observed difference  = ", round(dt[X == 1, mean(Y_obs)] - dt[X == 0, mean(Y_obs)],2))
[1] "Calculated observed difference  = 1.71"
lm1 <- lm(Y_obs ~ X, data = dt)
tidy(lm1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.481    0.0225      21.3 8.84e-99
2 X              1.71     0.0335      51.1 0       

Final thoughts

  • …..